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Abstract. An experimental unit is an opportunity to randomly apply or withhold a treatment. 
There is interference between units if the application of the treatment to one unit may also affect 
other units. In cognitive neuroscience, a common form of experiment presents a sequence of 
stimuli or requests for cognitive activity at random to each experimental subject and measures 
biological aspects of brain activity that follow these requests. Each subject is then many 
experimental units, and interference between units within an experimental subject is likely, in 
part because the stimuli follow one another quickly and in part because human subjects learn 
or become experienced or primed or bored as the experiment proceeds. We use a recent fMRI 
experiment concerned with the inhibition of motor activity to illustrate and further develop 
recently proposed methodology for inference in the presence of interference. A simulation 

evaluates the power of competing procedures. 
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1 Introduction: An Application of Inference with Interference 
1.1 What is interference between units? 

If treatment effects are defined as comparisons of the two potential responses that an ex- 
perimental unit would exhibit under treatment or under control (Neyman 1923, Welch 
1937, Rubin 1974, Lindquist and Sobel 2011), then an implicit premise of this definition 
is "no interference between units," as discussed by Cox (1958, p. 19): "There is no 'in- 
terference' between different units if the observation on one unit [is] unaffected by the 
particular assignment of treatments to the other units;" see also Rubin (1986). For in- 
stance, widespread use of a vaccine may benefit unvaccinated individuals because they are 
less likely to encounter an infected individual, a form of interference known as herd im- 
munity; see Hudgens and Halloran (2008). In agriculture, the treatment applied to one 
plot may also affect adjacent plots; see David and Kempton (1996). In social experiments, 
people talk, and changing the treatment applied to one person may change what she says 
to someone else, altering his response to treatment; see Sobel (2006). 

In some contexts, interference is of central interest in itself — this can be true of 
herd immunity or of social interaction, for example — but in many if not most contexts, 
interference is principally an inconvenience, depriving us of both independent observations 
and a familiar definition of treatment effects. We apply and extend a recent, general 
approach to inference with interference (Rosenbaum 2007a) in the context of a cognitive 
neuroscience experiment in which the brains of a moderate number of subjects are studied 
using fMRI while faced with a rapid fire sequence of randomized stimuli. In this context, 
interference is likely to be widespread and difficult to model with precision. The goal 
is a simple, sturdy, valid method of inference whose conclusions about the magnitude of 
treatment effects are intelligible when the interference may be complex in form. 



2 



1.2 Three themes: randomization inference, confidence intervals with interference, 
ineffective trials 

In this case-study, we reanalyze a randomized experiment in cognitive neuroscience with a 
view to illustrating three ideas, one very old idea, one somewhat new idea, and one idea 
that has evolved gradually over more than half a century. In many cognitive neuroscience 
experiments, a moderate number of subjects are repeatedly exposed to many randomly 
selected stimuli intended to elicit cognitive activity of a specific type together with its 
characteristic neurological activity visible with, say, fMRI. Three dilemmas arise in these 
experiments. First, because a few thoughtful, complex human subjects are observed 
many times performing simple repetitive tasks, subjects become familiar with the tasks, 
perhaps increasingly bored or skillful or distracted or fatigued or aware of the purpose 
of the experiment, so the situation is unlike a study of a single response elicited from 
each of many separated, unrelated subjects, and also unlike a stationary time series or a 
repeated measures model with dependence within subjects represented by additive subject 
parameters. In such a context, one might wish to draw inferences about treatment effects 
on many brain regions without relying on a model fitted to just a few people. Second, 
for reasons both biological and cognitive, rapid-fire stimuli are likely to interfere with 
one another, in part because the neurological response to one stimulus is expected to 
last well beyond the presentation of the next stimulus, and in part because learning and 
boredom and surprise are global cognitive responses to long segments of a sequence of 
stimuli not responses to a single stimulus. If 100 treatment/control tasks are presented to 
one subject in ten minutes, then it is unrealistic to characterize the effect of the treatment 
versus control in terms of response to single trials, because the response to each trial 
is affected by many previous trials. We need to characterize the differing responses to 
treated and control stimuli without assuming the mind and brain are born anew after each 
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stimulus. Third, the experimenter controls the stimuli, the requests for cognitive activity, 
but requests for cognitive activity may not produce the requested activity, and hence not 
produce the neurological activity characteristic of that cognitive activity. This is familiar 
from conversation, as when a speaker asks a question only to receive the reply: "Would you 
repeat that? I wasn't listening." If a statistical test is used that expects every stimulus 
to elicit its intended cognitive activity, the test may have much less power to detect actual 
activity than a test which acknowledges distraction and boredom and error in addition to 
the requested activity. 

The old idea, due to Sir Ronald Fisher (1935), is that randomization can form the 
"reasoned basis for inference" in randomized experiments, creating without modeling as- 
sumptions all of the probability distributions needed to test the null hypothesis of no 
treatment effect. In his introduction of randomized experimentation, Fisher (1935, Chap- 
ter 2) pointedly used a single-subject randomized experiment — the famed lady tasting tea 
- precisely because modeling and sampling assumptions seemed so inadequate to describe 
a single-subject experiment. In particular, there was no need to model the lady's evolving 
cognitive activity to test the null hypothesis that she could not discern whether milk or 
tea had first been added to the cup. Although it sometimes receives less emphasis in 
the statistics curriculum of 2011, Fisher's theory of randomization inference was viewed as 
one of the field's celebrated results. This method of using random assignment to replace 
modelling assumptions was described by Jerzy Neyman (1942, p. 311) as "a very brilliant 
one due to Fisher," and in retrospect Neyman (1967, p. 1459) wrote: "Without random- 
ization there is no guarantee that the experimental data will be free from a bias that no 
test of significance can detect." In a similar vein, John Tukey (1986, p. 72) recommended: 
"using randomization to ensure validity — leaving to assumptions the task of helping with 
stringency." (Stringency is decent power in difficult situations, in the spirit of the formal 
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notion of a most stringent test which minimizes the maximum power loss over a class of 
alternatives.) 

The newer idea addresses a limitation of Fisher's method when used in the presence 
of interference. Fisher's method yields a valid test of the null hypothesis of no effect. 
If a treatment effect has a simple form, say an additive constant effect or shift, then it is 
possible to invert Fisher's test of no effect to yield a confidence statement for the magnitude 
of this constant effect (e.g., Lehmann 1975); however, by its nature, interference precludes 
such a simple form for an effect. The newer idea is to invert the randomization test of no 
effect to yield a confidence interval for an attributable effect in the presence of interference 
that contrasts the results seen with an active treatment to the results that would have 
been seen in an experiment of identical design but with no active treatment, a so-called 
"uniformity trial" common in the early years of randomized agricultural experimentation 
(Rosenbaum 2007a). This newer idea is applicable with distribution- free statistics whose 
distribution in the uniformity trial is known without conducting the uniformity trial. The 
classes of distribution-free statistics and of rank statistics overlap substantially but are not 
the same, and it is the distribution-free property that is needed here. 

The third, gradually evolving idea made a first appearance in a paper by Lehmann 
(1953) concerned with the power of rank tests. After showing that Wilcoxon's test was 
the locally most powerful rank test for a constant, additive effect in the presence of logistic 
errors, Lehmann went on to show that it was also locally most powerful against a very 
specific mixture alternative in which only a fraction of subjects respond to treatment. 
Conover and Salsburg (1988) generalized the mixture alternative and derived the form 
of the corresponding locally most powerful test; this was no longer Wilcoxon's test, but 
rather a test that gave greater emphasis to larger responses. Although they substantially 
increase power when some trials fail to elicit the intended effect, the ranks used by Conover 
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and Salsburg have no obvious interpretation, so they cannot be used as the basis for an 
attributable effect. It turns out, however, that Conover and Salsburg's ranks are almost 
the same as ranks proposed by Stephenson (1981); see also related work by Deshpande and 
Kochar (1980) and Stephenson and Gosh (1985). Using Stephenson's ranks, a confidence 
interval for attributable effects becomes available (Rosenbaum 2007b), thereby permitting 
inference about the magnitude of the effect in the presence of interference. 

Although the main goal is to illustrate these three ideas in the context of an fMRI 
experiment, along the way a technical issue arises. The experiment is not a perfectly 
balanced design, and for unbalanced designs the attributable effect has a more natural 
interpretation if it is not formulated in terms of a linear rank statistic, but rather in terms 
of a linear placement statistic in the sense of Orban and Wolfe (1982), which is a form 
of nonlinear rank statistic. Because perfect balance is difficult to achieve in cognitive 
neuroscience experiments, we develop the formalities in terms of the placement statistics 
that are most likely to be useful in practice. 

1.3 A randomized experiment in the cognitive neuroscience of motor inhibition 

In the experiment by Duann, Ide, Luo and Li (2009), each of 58 experimental subjects 
was observed in four experimental sessions that were each about ten minutes in length. 
At random times during a session, a trial began, with a median of 97 trials per session. 
With probability |, the trial was a "go trial:" a dot was presented on a screen, and after a 
interval of time of random length, the dot became a circle signifying that the subject was 
to quickly press a button. With probability \, the trial was a "stop trial:" the trial began 
as a go trial, but briefly after the circle appeared it was replaced by an X signifying "do 
not press the button." In a stop trial, the subject is instructed to do something, and then 
the instruction is cancelled. Here, an experimental unit is a trial, with stop trials called 
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'treatment' and go trials called 'control.' During an experimental session, brain activity 
was recorded using fMRI at two second intervals. The experiment sought to determine 
how the brain reacted differently to go and stop trials, where stop trials call for inhibition 
of a previously requested motor response from the subject. 

Figure 1 shows one session for one subject. The vertical grey lines are go trials. 
The vertical black lines are stop trials. Based on fMRI, Figure 1 shows activity in the 
subthalamic nucleus (STN). Aron and Poldrack (2006) and Li et al. (2008) suggested that 
the STN plays an important role in response inhibition. In the lower portion of Figure 1, 
the STN activity is filtered without use of the stop/go distinction. The filter is a high-pass 
filter of 128s: it removes slow, low frequency drifts, leaving behind the high frequency ups 
and downs thought to reflect brain activity. The effects of the filter are somewhat visible 
in Figure 1. For the STN, we analyze unfiltered and filtered data in parallel, obtaining 
similar conclusions. In effect, the experiment produces 58 x 4 = 232 figures analogous to 
Figure 1, one for each subject in each session, and does this for many regions of the brain. 

The assumption of "no interference between units" is not at all plausible in Figure 1. 
A typical session has about a hundred trials or experimental units in about 600 seconds. 
There is interference if the response of a subject at a given trial is affected by treatments at 
other trials. Interference is likely for at least two reasons. First, the brain has a measurable 
response to a stimulus for many seconds after the stimulus has been withdrawn, so a subject 
is still responding to one trial when the next trial begins. Second, the response to a stop 
trial preceded by a long string of go trials may be different from the response to a stop trial 
preceded by another stop trial. In addition, as time goes by, subjects are experiencing 
the normal responses people have when performing a repetitive task: they become familiar 
with the task, or bored by the task, or less distracted by the recording equipment and more 
focused on the task, or distracted by something else, and each of these is a response to their 
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entire past experience. The 22,440 trials in this experiment are nothing like a randomized 
clinical trial with 22,440 unrelated people who do not interfere with each other. It is, 
nonetheless, a randomized trial and randomization can form the basis for inference, as it 
did in Fisher's (1935, §2) prototype trial of one lady tasting eight cups of tea. 

The trial has a second important feature. Not all trials are "successful." In the 
first instance, in a stop trial, the subject is instructed first to "go" — press the button 
- and then the instruction is cancelled. In a stop trial, if the random time between 
the circle and the X is longer than usual and the subject is quicker than most, then she 
may press the button before the instruction is cancelled. In this case, even though the 
trial is randomized to be a stop or treated trial, her brain should exhibit the response 
typical under the control condition, because nothing she experienced distinguished the 
trial from a go trial. In addition to the situation just described, it may also happen 
that the subject is unambiguously told to press the button but does not do so, or is 
unambiguously told not to press the button but does so anyway, perhaps because the 
subject is momentarily distracted. Also, a subject may exhibit correct behavior with 
erroneous thoughts, say failing to press the button because of distraction or fatigue rather 
than inhibition. Expressed differently, whether or not a trial is successful is not generally 
a visible property of the trial, yet we are confident that human subjects do not always 
think the thoughts an experimenter requests. If a trial is not successful in any of these 
senses, then the requested cognitive activity may not take place, so there may not be 
the change in blood oxygenation that would typically accompany the requested cognitive 
activity. Although a stimulus asks for a cognition, we cannot tell whether the cognition 
took place or not, because we see only behavior and neurological response, but it is unlikely 
that every stimulus elicits its intended cognition. We might think of responses as a mixture 
of successful and unsuccessful trials, where successful trials produce a specific pattern of 
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fMRI response. Salsburg (1986), Conover and Salsburg (1988) and Rosenbaum (2007b) 
consider rank tests that are particularly effective when only a subset of experimental units 
respond to treatment. These rank tests score the ranks in such a way that little weight 
is given to lower ranks. In the current paper, a similar approach is taken in studies with 
interference between units. 

When a region of the brain is stimulated to activity, the change in blood oxygenation 
measured by fMRI is not immediate. There is a brief delay, perhaps a dip, for about 2 sec- 
onds, followed by a sharp rise, a sharp fall to slightly below baseline, followed by a gradual 
return to baseline; see Lindquist (2008, Figure 3). This curve is known as the hemodynamic 
response function (HRF). We use the form developed by Friston et al. (1998), specifically 
a weighted difference of two gamma densities, 7 (x;w, $) = i? w x w_1 exp (— i?x) /T (co), both 
with parameter 7? = 1/16, and with shape parameters lo± = 6 and 0J2 = 16, specifically the 
function hrf (x) = 7 (16x; 6, 1/16) — 7 (I6x; 16, 1/16) /6 where x is in seconds. Although 
we do not report these results, we tried a second form for the HRF with a similar shape 
but built from inverse logit functions (Lindquist and Wager 2007), obtaining qualitatively 
similar results in a table parallel to Table 2. 

Recall that the measurements in Figure 1 occur at two second intervals. Evaluating the 
hemodynamic response function, hrf (x), at two-second intervals, we computed 17 weights 
for 17 two-second intervals that follow each trial, that is, for the 2 x 17 = 34 seconds 
that follow a trial. These weights sum to one. The first weight is zero, the third and 
fourth weights are the largest (.375 and .385), and beginning at the eighth the weights 
turn slightly negative (weight eight is —0.031) gradually returning to zero (weight 17 is 
—0.0001). At the end of a session, if fewer than 17 two-second intervals remained, we used 
the remaining intervals and renormalized the weights so that they again summed to one. 
For a region such as STN in Figure 1, after each trial, we computed the sum of the HRF 
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weights multiplied by the fMRI measurements. If a region of a subject's brain has become 
unusually active, we expect this weighted average to become unusually large. Figure 2 is 
a pair of boxplots of this weighted average for the one session and one subject in Figure 
1. In Figure 2, there is some indication that the responses in the stop or treated trials in 
Figure 1 are elevated. 

Although the analysis uses the responses in Figure 2 weighted by the hrf (x) function, 
the method is applicable with any method of scoring the trials that produces one number 
per trial. For instance, a response that is sometimes used is the correlation between the 
hrf (x) function and the sequence of responses that immediately follow a trial. 

1.4 Outline 

Section [2] reviews notation from Rosenbaum (2007a) for treatment effects when interference 
may be present. In §3. 1 ^ a nonlinear rank statistic Tz is proposed for a randomized block 
design with blocks of unequal block sizes; in particular, Tz is intended to perform well 
when not all treated trials are successful in eliciting the intended cognitive activity. Under 
the null hypothesis of no treatment effect there is, of necessity, no interference among the 
treatment effects, and §3.21 uses ideas from Orban and Wolfe (1982) to obtain the null 
randomization distribution of the test statistic Tz- A confidence statement about the 
magnitude of effect with interference is then obtained by a pivotal argument in §3.31 it 
measures the magnitude of the difference between the actual trial and the uniformity trial. 
In JJU the method of §3.31 is applied to activation of the subthalamic nucleus in §1.31 A 
simulation in ^evaluates the power of Tz in experiments with interference using a mixture 
model in which not all trials elicit the intended cognitive activity. 
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2 Notation: The Randomized Trial and the Uniformity Trial 
2.1 Blocked randomized trial with interference between units 

There are B > 1 blocks, b = 1, . . . , B, and iV& > 2 units 6i in block 6, i = 1, . . . , Nb, with 
N = ^2 Nb units in total. In EJTJ, there are B = 58 x 4 = 232 blocks consisting of the four 
sessions for each of 58 subjects, N = 22, 440 units or trials in total, with 87 < Nb < 104 and 
a median Nb of 97. In block b, rib units are picked at random for treatment, 1 < rib < Nb, 
the remaining nib = Nb — rib > 1 units receive control. In SfjQ the probability that a unit 
was a "stop trial" was -j, resulting in 13 < rib < 37 and a median rib of 24. If unit i in 
block 6 was assigned to treatment, write Zbi = 1, and if this unit was assigned to control 
write Zbi = 0, and let Z = [Z\\, Z\2, ■ ■ ■ , Zb,n b ) T be the iV-dimensional vector in the 
lexical order. For a finite set A, write \A\ for the number of elements of A. Write Q, 
for the set containing the \Q\ = \\b=\ (^) possible values z of Z, so z € if and only 
if zu = or zu = 1 and rib = zJi=i z bi for 6 = 1, . . . , B. Write n = (m, . . . ,hb) and 
m = (mi, . . . ,m B ) T . 

Write rbi z for the response that the ith. unit in block b would have if the treatment 
assignment Z equalled z for z£fi, In §1.31 for trial i of subject /session b, the response 
rbi z is the HRF weighting of either the unfiltered or filtered activity in the subthalamic 
nucleus (STN). Each unit has |f2| potential responses, only one of which is observed, 
namely r^z- Figure 2 plots r^z for Z^ = and Zu = 1 for one b and i = 1, . . . , Nb- 

Unlike the notation of Neyman (1923) and Rubin (1974), the response of the ith unit 
in block b may depend on the treatments Z assigned to all the units; that is, this notation 
permits interference (Rosenbaum 2007a). In fTTJ it is quite plausible that a previous 
treatment for one subject may affect later responses of this same subject. Indeed, it is 
possible that interference extends across the four blocks or sessions for a given subject. 
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Write TZ for the unobservable array with iV rows and |fi| columns having entries r^. The 
unobservable TZ describes what would happen under all possible treatment assignments 
z € $7, but TZ does not change when actual randomized treatment assignment Z is selected. 
In contrast, the observable responses ruz are one column of TZ, and which one column that 
is does, of course, depend upon the randomized treatment assignment, Z. Fisher's (1935) 
sharp null hypothesis Hq of no treatment effect asserts that r^x = fbix 1 for all z, z' € $1 
and all b,i, so within each row hi of TZ all |0| columns have the same value for r^x- 

No interference between units means that r^ z = r^i whenever = z' bi , that is, the 
response in block b at trial i depends on the treatment zu assigned in block b at trial i, 
but it does not depend on the treatments z assigned at other trials. As discussed in §1.31 
no interference between units is not plausible in Figure 1, and because of the overlapping 
of HRF functions is virtually impossible in Figure 2 if Hq is false. 

By a randomized block experiment, we mean that 

Pr (Z = z | TZ, n, m) = -—r for each z € Vt. (1) 

In £JTJ the timing of trials and hence also the number iVj, of trials was determined by a 
random process; then, with probability \ the trial was a "stop trial" and with probability 
| the trial was a "go trial;" hence, n andm were random variables, but the conditional 
probabilities given TZ, n, m of particular patterns of stop or go trials was completely ran- 
domized within each block in the sense that ([1]) was true. Importantly, ([1]) says treatment 
assignments were determined by a truly randomized mechanism that ensured the unob- 
servable potential responses TZ were not predictive of treatment assignment Z; this is, of 
course, the essential element of randomized treatment assignment. In a randomized exper- 
iment, any association between treatment assignment Z and the observed responses, r^z, 
is due to an effect caused by the treatment expressed in TZ; because of ([T]), an association 
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between Z and r^z cannot result from biased selection into treated and control groups if 
the treatment has no effect, that is if Fisher's Hq is true. 

2.2 The uniformity trial 

As mentioned previously, in the absence of interference, it is natural to ask how a unit would 
have responded if that one unit had received the treatment and or if that one unit had 
received the control and to define the effect of the treatment on this unit as a comparison 
of these two potential responses; see Neyman (1923), Welch (1937) and Rubin (1974) for 
discussion of this standard way of defining treatment effects in randomized experiments. 
This formulation does not work when there is interference between units because a unit 
may be affected by treatments applied to or withheld from other units. Some definition 
of the treatment effect with interference is needed if a randomization test of the null 
hypothesis of no effect is to be inverted to obtain a confidence interval for the magnitude 
of effect. In principle, the treatment effect is characterized by the N x array 71, where 
N = nf=i ©; however, that array is mostly not observed, and it is so large and detailed 
that it would be beyond human comprehension even if it were observed. We would like 
to define the treatment effect as a summary of 1Z, but in such a way that the summary is 
intelligible and usable in inference. 

We define the treatment effect with reference to a uniformity trial of the type that, in 
a certain era, was commonly used as an aid to designing experiments; see Cochran (1937). 
For instance, uniformity trials were once used to study the performance of competing 
experimental designs, such as complete randomization or randomized blocks or randomized 
Latin squares. In a uniformity trial, treatment assignment Z is randomized as if an 
actual experiment were about to be performed, but instead Z is ignored and the standard 
treatment is applied in all cases. In its original use, a uniformity trial divides a farm 
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into plots, assigns plots to a new treatment or a standard control at random, ignores 
the random assignment Z and applies in all cases the standard fertilizer, insecticide, etc., 
and ultimately harvests the crops recording yields in each plot. Essentially, the farmer 
cooperated in setting up an experiment and recording results, but he worked the farm in 
the usual way, harvesting the usual crops for sale. This produced a simulated experiment 
with real crops in which the null hypothesis of no treatment effect is known to be true. For 
instance, by comparing two uniformity trials, perhaps at the same farm, one might discover 
that the estimated standard error is smaller from a uniformity trial designed as a Latin 
square than another uniformity trial designed as randomized blocks. In a certain era, 
statisticians did this, so in our era it is easy to imagine something that was once actually 
done. 

We define the treatment effect with interference with reference to a uniformity trial. 
Stated informally, the effect a treatment with interference is a comparison of what happened 
in the actual experiment with its active treatment to what would have happened in a 
uniformity trial with the same treatment assignment Z but no active treatment. As in an 
era gone by, it is a comparison of two whole experiments, rather than a comparison of a 
treated and a control group. With interference, both treated and control units are affected 
by treatments applied to other units, so a comparison of a treated and a control group is not 
a comparison of a treated and an untreated situation. A comparison of an experiment with 
an active treatment and a uniformity trial is a comparison of a treated and an untreated 
situation. In §1 .31 this is a comparison of an experiment with randomized stop and go 
trials and an experiment of identical structure with only go trials. Conveniently, using a 
few technical tools in §3.31 that were not available in 1937, we can make inferences about 
a uniformity trial that was never performed with the aid of a distribution-free pivotal 
quantity. 
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Write ru for the response of unit i in block b in the uniformity trial. There is only 
one such r^, not one for each z € f2, because the realized treatment assignment Z that 
was recorded in an office has no way to affect the biological response of unit i in block 
b. Because the uniformity trial was not actually performed, none of the r&j are observed. 
Generally, r&j need not equal any of rbiz, z £ Q. If there were no interference between 
units, then ru would equal for every z € f2 with Zbi = 0, because without interference 
the response of unit bi depends only on the treatment 

2^,2 clSS.l gned to bi; however, with 
interference, it can happen that r&j ^ r^ z for every ze!l. Write r = (fu, . . . ,rB,N B ) T ■ 

In the presence of interference between units, the magnitude of the treatment effect is 
understood not as a comparison of treated and control groups both of which are affected 
by the treatment, but as a comparison of the actual experiment and the uniformity trial. 

3 Inference with Interference 

3.1 Preliminaries: a nonlinear rank statistic; testing no treatment effect 

Fix an integer k > 2, with k < min;, g .n ,,..,b} m b + 1- As will be seen, the familiar choice is 
k = 2, and it yields the Mann- Whitney U-statistic, but there are reasons to prefer a larger 
value of k when only some treated units respond to treatment. Ties among responses are 
not an issue in the fMRI experiment of §1.31 where blood oxygenation is recorded to many 
digits. We assume no ties in the discussion that follows. 

The technical material that follows is not difficult but does require a certain amount of 
notation. To simplify, the reader may consider the special case of a single block (B = 1) 
with the parameter k set to k = 2; then, one is considering a single-subject completely 
randomized trial, like the lady tasting tea, using the Mann- Whitney- Wilcoxon statistic, 
which happens to be the only linear placement statistic that is also a linear rank statistic 
(Orban and Wolfe 1982). 
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For a specific treatment assignment, z £ fi, consider a subset <S;> Z = {ii, . . . , of k 
units from the same block b with one treated unit, = 1, and k — 1 control units, z^ = 0, 
j = 2, . . . , k. Write /Cf, z for the collection of all ^(^J such subsets Sbz for block b, so 
<Sbz 6 ZC&z if and only if <S bz C {1, . . . , N b } with |<S 6z | = k and 1 = J2ieS b2 z bi- 

The set Sb z compares one treated unit to k — 1 control units. Write w (5;, z ) = 1 if 
the treated unit, say i\, in Sbz = {h, - • • ,ik} has the largest response under assignment 
z, that is, if r^ lZ > maxj g | i2) ...,j fe }^6j Z j and write t> (<Sf, z ) = otherwise. For k = 2, the 
set <Sb z = {ii,«2} has one treated unit and one control and v (5ft z ) = 1 if the treated unit 
has a higher response than the control under assignment z. Also, let Wb be a weight to 
be attached to block b where u>b is a function of n and m. In the current paper, Wb = 1 
for all b, but another reasonable definition of Wb will be given in a moment. For this 
treatment assignment z, the quantity T z = Ylb=i Wb Yls b2 eK, b2 v i^bz) is a weighted count of 
the number of sets Sb z such that the treated unit had a higher response than k — 1 controls. 
With Wb = 1, the quantity T z is a count, and a count is reasonable if the Nb and rib do not 
vary much, as is true in §1.31 If the rib and rrib varied greatly with b, then Wb given by 
l/wb = B nb( fc m * 1 ) makes T z the unweighted average over the B blocks, b = 1, . . . , B, of the 
proportion of sets 5& z € /C& z in which the treated unit had a higher response than k — 1 
controls, v (Sb z ) = 1. For most z, the quantity T z depends upon parts of TZ that are not 
observed, so T z cannot be computed from the observed data. 

If the randomized treatment assignment Z replaces the specific treatment assignment 
z, then needed parts of TZ are observed, and Tz is a statistic that can be computed from 
the data. Indeed, if B = 1, k = 2, and wi = 1, then Tz is the Mann- Whitney U-statistic 
and is linearly related to Wilcoxon's rank sum statistic. More generally, for k = 2 and 
B > 2, Tz is a weighted sum of -B Mann- Whitney statistics; see Lehmann (1975, §3.3) and 
Puri (1965) who discuss weights intended to increase power against shift alternatives in 
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the absence of interference. 

For k > 2, if n& = 1 and iVj = Af/-B does not vary with 6, then Tz is the statistic 
discussed in Rosenbaum (2007b). Taking k > 2 tends to increase power when only a 
subset of treated units respond to treatment, as seems likely here for reasons discussed 
in §1.31 Indeed, with k > 2, the ranks are scored in a manner that closely approximates 
Conover and Salsburg's (1988) locally most powerful ranks for an alternative in which 
only a fraction of treated units respond, and the scores are identical to those proposed by 
Stephenson (1981). 

In these special cases of the two previous paragraphs, Tz is a stratified linear rank 
statistic. In general, Tz is a function of the ranks, but not a linear function; however, it is 
a sum of B linear functions of the placements within blocks in the sense of Orban and Wolfe 
(1982). For B = 1, the statistic with k > 2 has been discussed by Deshpande and Kochar 
(1980) and Stephenson and Gosh (1985) as an instance of Hoeffding's (1948) U-statistics 
under independent sampling of two distributions. Because interference precludes indepen- 
dent observations, inferences must be based on the random assignment of treatments, and 
for this the combinatorial development in Orban and Wolfe (1982) is particularly helpful. 

Orban and Wolfe (1982) define the placement nibUbj of the jth. treated unit in block 
b to be the number of controls in block b who have a response less than or equal to the 
response of this treated unit. A linear placement statistic for one block b is then of the form 
Si=i ^n^.m;, (Ubj) for some function 4>n b ,m b (•)• If there are no ties among responses within 
blocks, then taking <pn b ,m b (u) = w b expresses T z = J2b=i w b J2s bZ eK b * v ( 5 &z) as a 

sum of linear functions of placements of the treated units, Tz = Ylb=i Sj=i §n b ,m b (Ubj), 
where is defined to equal zero for I < k — 1. 

Consider testing Fisher's null hypothesis Hq of no effect which asserts that r&j z = r&j z / 
for all z, z' 6 Q, and all b, i. If Hq were true, then the observed response r^z is the 
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same no matter what Z € 0, is randomly selected, so TZ is known and the distribution 
of Pr (Tz > 1 1 7?-, n, m) is determined by the known fixed TZ and the randomized treat- 
ment assignment ([1]). Indeed, in part because no effect entails no interference in effects, 
testing no effect Hq is a straightforward application of randomization inference. Orban 
and Wolfe (1982, §2) determine the null distribution of their linear placement statistic 
Sj=i §n b ,m b (Ubj) under independent sampling from a continuous distribution; however, 
their argument is entirely combinatorial, and it is easily seen that if responses with blocks 
are not tied then their argument and results give the exact null randomization distribution 
°f Y^j=i 4>n b ,m b (Ubj)- Moreover, given TZ, n, m, under Hq and (P), Tz is the sum of B 
conditionally independent terms each with the known null distribution in Orban and Wolfe 
(1982, §2). Importantly, in the absence of ties, this null distribution of Tz depends upon 
n, m, but not on TZ. 

3.2 The distribution of the test statistic in the uniformity trial 

In the uniformity trial of §2,2} the null hypothesis of no effect on is known to be 
true because, following a concealed randomization Z, no treatment was applied. Let T z 
be the value of the statistic of §3.11 computed from the rji in the uniformity trial when 
Z = z £ f2, with value Tz under the realized random assignment Z. Specifically, write 
v (Sbz) = 1 if the treated unit, say i\, in Sb z = {ii, ■ ■ ■ ,ik} has the largest response under 
assignment z, that is, if > max je { i2j ^ jf^, and write v (Sb z ) = otherwise, so 
that T z = Ylb=l w b Yls bz eic bz v ($bz)- Even though ru is not affected by the treatment 
assignment Z, the statistic Tz is generally a nondegenerate random variable because the 
value of the statistic depends jointly on the responses of units, r&j, which do not fluctuate, 
and on the treatments they receive, Z^, which are random. 

In point of fact, neither Tz nor T z can be computed from observed data, because the 
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uniformity trial was not performed and none of the ru are observed. Nonetheless, in 
the absence of ties, the distribution of Tz is known, because the null distribution in ^3.11 
depends upon n and m but not on TZ; specifically, it is the convolution of B random 
variables whose exact distributions are given by Orban and Wolfe (1982, Theorem 2.1, and 
expressions (2.1) and (2.2)), with expectation and variance 



var 




^ (m b + 1) (m 6 + 2) 
moreover, for reasonable choices of weights, Wb, as B — > oo, 



n b ,m b 



{t z - E (t z ) } /J var (t z ) 4 $ (•) 



where (•) is the standard Normal cumulative distribution. To emphasize, because the 
null hypothesis of no effect is known to be true in the uniformity experiment, and because 
the null distribution of Tz depends upon n and m but not on the r&j's, it follows that we 
know the distribution of Tz in the uniformity trial even though we did not perform the 
uniformity trial and even if the treatment did have an effect with interference in the actual 
randomized experiment. This fact turns out to be useful with the aid of the concept of 
attributable effects (Rosenbaum 2001, 2007a, 2007b). 

3.3 Attributable effects 

Consider a specific treatment assignment z € SI and the specific comparison Sbz = {h, ■ ■ ■ , ik} £ 
K.b z of one treated unit, say i\ with = 1, and k — 1 control units, ij with Zyi- = for 
j = 2, . . . , k. If rbi lX > max Jg |j 2) ik y r^jz then i\ had the highest response in this compar- 
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ison, contributing a 1 rather than a to T z ; however, this might or might not be an effect 
caused by the treatment, because even under the nuli hypothesis of no effect Hq, one of 
the k units will have the highest response among the k units. If rbi lZ > maxj 6 rj 2 i ... t i k } rbj z 
but ri,i 1 < max Jg |j 2i .. i fc }^5j then treatment assignment z in the actual experiment does 
cause unit i\ in block b to have a higher response than units {12 , ■ ■ ■ , ik} hi block b in the 
sense that unit i\ in block b would not have had the highest response in this comparison 
in the uniformity trial of §2.21 in which no unit was treated. In §1.31 this would mean that 
in block 6, stop trial i\ caused activity in the STN region to exceed the level in go trials 
12, ■ ■ ■ ,ik in the sense that the activity was higher in the actual experiment and would 
not have been higher in the uniformity trial. Conversely, if rbi lZ < maxj g {j 2j i k \ rbj z but 
r&jj > maXj G |j 2j j fc }rfej then treatment assignment z in the actual experiment prevented 
treated unit i\ from having the highest response in Sb z , in the sense that i\ would have 
had the highest response in the uniformity trial but did not have the highest response in 
the actual experiment. The third possibility is that treatment assignment z does not alter 
whether or not i\ has the highest response in Sb z - Concisely, these three situations are: 
(i) v (Sbz) = 1 and v (Sb z ) = 0, (ii) v (Sb z ) = and v (Sb z ) = 1, and (hi) v (Sb z ) = v(Sb z )- 
For treatment assignment zeU, the attributable effect 

B 

K = T z - f z = ^ w b ^2 { V ( 5f>z ) ~ ^ ( S bz)} 

is the net increase in the number of times (weighted by Wb) that a treated response in 
the actual experiment exceeded k — 1 control responses because of effects caused by using 
treatment assignment z. So A z is a real valued function of z, r and TZ. In contrast, 
Az is the attributable effect for the Z randomly chosen according to (H|), so Az is the 
difference between an observed statistic, Tz, that describes the actual experiment, and an 
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unobservable random variable Tz that describes the uniformity trial in £ j2.2( however, the 
distribution of Tz is known, as discussed in §3.21 In brief, Az is an unknown random 
quantity which provides a reasonable measure of the effects of the treatment despite the 
presence of interference between units. More precisely, Az compares the aggregate effects 
of the treatment in the presence of interference to the pattern that would be exhibited in 
the uniformity trial in which no one is treated. If Fisher's null hypothesis of no effect Hq 
is true, then E (Az) = 0. For discussion of attributable effects in randomized experiments 
without interference, see Rosenbaum (2001). 



Let t a be the smallest value such that Pr (Tz < t a \ TZ, n, mj > 1 — a. From ([2]), for 
large B, we may approximate t a as 



The following fact parallels a result in Rosenbaum (2007a) for a different family of statistics. 
In particular, Proposition [1] yields a one-sided 1 — a confidence interval for the unobserved 
random variable Az in terms of the observed random variable Tz and the known quantity 
t a . See Weiss (1955) for general discussion of confidence intervals for unobservable random 
variables. 

Proposition 1 In a randomized experiment with interference in which |7]) holds and there 
are no ties, 





Pr (A z >T z -t a | TZ, n, m) > 1 - a. 



The proof of Proposition [T] is immediate: 



Pr (A z >T z -t a | TZ, n, m) = Pr T z - T z > T z - t a | TZ, n, m 



) 



= Pr T z < t a | TZ, n, m > 1 - a. 
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The attributable effect A% depends upon the sample sizes, n and m, and the choice of k. 
Dividing Az by E (Tz^ can aid interpretation. Then 100 x Az/E (Tz^ is the (weighted) 
percent increase above chance in the number of times a treated unit had a higher response 
than k — 1 controls due to effects caused by the treatment, and with confidence 1 — a, the 



We are suggesting that the unobservable random variable ^4z/E [Tzj is a useful mea- 
sure of the magnitude of the treatment effect when interference may be present; however, 
it is a new measure, and its magnitude is unfamiliar. To build some intuition about mag- 
nitudes of Az/~E (Tz^J, consider its behavior in a familiar context, namely a single block, 
B = 1, independent observations without interference and a treatment effect that is an 
additive shift, 5. In this case, as n\ — > oo and m\ — > oo, the quantity Tz/ | n i(fc™- 1 i)| 
converges in probability to the probability that a treated response exceeds k — 1 control 
responses and Az/E {Tz^j converges in probability the percent increase in this probability 
above the chance level of 1/k. Table 1 evaluates these limits for the standard Normal 
distribution and the t-distribution with 2 degrees of freedom. For instance, with a shift 
5 in a Normal that equals a full standard deviation, 5 = 1, the probability that a treated 
response exceeds nine control responses in 0.341 which is 241% above the chance level of 0.1 
for 5 = 1. The quantity Az/E {^T^j has the advantage that it continues to be meaningful 
with interference where shift models are inapplicable. 

Proposition[T]refers to an analysis of responses, but it is possible in a randomized exper- 
iment to use the same approach with residuals from a robust covariance adjustment which 
controls for measured disturbing covariates such as head motion. See Rosenbaum (2002) 
for general discussion of randomization inference for covariance adjustment in randomized 
experiments, and see Rosenbaum (2007a, §6) for its application with interference. This 
procedure is illustrated in $4] in Table 4. 



unobserved 100 x ,4 Z /E (t z ) is at least 100 x (T z - t a ) /E (T z 
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Table 1: In the absence of interference and dependence, the upper table gives the proba- 
bility that one treated response is higher than k — 1 independent control responses when 
the treatment effect is an additive shift 5 and the errors are independently drawn from 
either a standard Normal distribution or the t-distribution with 2 degrees of freedom. The 
lower table gives the percentage increase in this probability above chance; for example, 
14% = (0.57 - 0.50)/0.50. 







Probability a treated response is 








higher 


than k — 1 controls 




5 


Normal 


t with 2 df 




k = 2 


k = 5 k 


= 10 


k = 2 


k = 5 k 


= 10 





0.50 


0.20 


0.10 


0.50 


0.20 


0.10 


0.25 


0.57 


0.26 


0.14 


0.55 


0.24 


0.12 


0.5 


0.64 


0.33 


0.20 


0.60 


0.29 


0.15 


1 


0.76 


0.49 


0.34 


0.69 


0.39 


0.22 




Percentage increase above chance 


5 


Normal 


t with 2 df 




k = 2 


k = 5 k 


= 10 


k = 2 


k = 5 k 


= 10 























0.25 


14 


31 


44 


10 


21 


22 


0.5 


28 


67 


99 


20 


45 


50 


1 


52 


147 


241 


39 


97 


120 
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4 To what extent do stop trials activate the subthalamic nucleus? 

Is activity in the subthalamic nucleus (STN) elevated following stop trials? Figures 1 and 
2 depict STN activity for one subject in one session, but there are 58 subjects, each with 
4 sessions, making 58 x 4 = 232 blocks, with a total of N = 22, 440 randomized go-or-stop 
trials. 

Table 2 performs the analysis in £}3] three times, for k = 2, 5 and 10. Recall that 
for k = 2, the statistic Tz is the sum of 232 Mann- Whitney- Wilcoxon statistics. The 
deviates for testing the null hypothesis of no effect Hq are extremely large, particularly 
for the filtered data with k = 5 or k = 10. In the uniformity trial, we expect that when 
comparing a treated unit to nine controls, one time in ten the treated unit will have the 
highest response. For filtered STN, k = 10, the point estimate of Az/E (^z^ suggests 
a 53.0% increase above this chance expectation due to effects caused by the treatment, 
but we are 95% confident of only a 46.4% increase. Again, in the presence of arbitrary 
interference between units, these are correct statements about the relationship between the 
actual trial, with its unobserved attributable effect Az, and the uniformity trial that was 
not actually performed. 

In addition to the subthalamic nucleus, other regions of the brain are suspected to 
be involved in motor response inhibition, including the right inferior frontal cortex (or 
rIFC, see Fortsmann, et al. 2008) and the presupplementary motor area (or preSMA, see 
Simmonds, Pekar and Mostofsky 2008). In analyses parallel to Table 2, we found smaller 
but significantly elevated activity in both the rIFC and preSMA. Using filtered data for 
rIFC with k = 5, we obtained a P-value testing no effect of 0.0030, a point estimate for 
A z /E (Tz) of 0.059 and a 95% confidence interval of A z /E (T z ) > 0.024. Using filtered 
data for preSMA with k = 5, we obtained a P-value testing no effect of 0.000053, a point 
estimate for A z /E (f z ) of 0.084 and a 95% confidence interval of A z /E (f z ) > 0.048. 





24 



Table 2: Randomization test of no treatment effect Hq and randomization-based confidence 
interval for the attributable effect Az in the presence of interference between units. 





Test of No Effect 


Fractional Increase Az /E ( T z ) 






Deviate Testing H 


Point Estimate 


95% CI 






(Tz-E(Tz)) 


r z -E(fz) 


Tz~ tat 






y v ar(f z ) 


E(f z ) 


E(T Z ) 


k 


Unfiltered STN 


k 


= 2 


8.427 


0.076 


0.061 


k 


= 5 


8.874 


0.192 


0.156 


k 


= 10 


8.161 


0.327 


0.261 






Altered STN 


k 


= 2 


11.000 


0.099 


0.084 


k 


= 5 


13.630 


0.295 


0.259 


k 


= 10 


13.219 


0.530 


0.464 



Although the point estimates of 5.9% above chance for rIFC and 8.4% above chance for 
preSMA are significantly larger than zero, they are substantially smaller than the point 
estimate of 29.5% above chance for filtered STN with k = 5 in Table 2. 

For k > 2, the statistic Tz and unmeasurable attributable effect Az handle the treated 
and control groups in different ways: one treated unit is compared to k — 1 controls. If 
one expected successful stop trials to suppress rather than elevate activity, one needs to 
apply Tz to — r^z rather than to r^z- For instance, we might expect reduced activity in 
the primary motor cortex (Ml) during stop trials, because motor activity is not requested 
in a stop trial. Applying Tz to — ruz f° r filtered data from Ml with k = 5, we obtain a 
P-value testing no effect of 0.000011, a point estimate for A z /E (t z ) of 0.092 and a 95% 
confidence interval of Az/~E {Tt^ > 0.056, consistent with reduced activity in Ml in stop 
trials. 

The inferences just described are appropriate in the presence of interference of arbitrary 
form. But is there interference? Here, we look at one very simple possible form for 
interference, namely interference from the immediately previous trial. Recall that trials 
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Table 3: Testing for a simple form of interference: comparison of go-go trials and stop-go 
trials for STN. 





Test of No Lingering 


Effect 


Fractional Increase Az/~E (^z) 




Deviate 


Point Estimate 


95% CI 




Filtered 


k 


= 2 


2.906 




0.031 


0.013 


k 


= 5 


3.151 




0.085 


0.040 


k : 


= 10 


3.095 




0.201 


0.094 



are randomly go or stop trials, where go trials occur with probability .75 and stop trials with 
probability 0.25. Aside from the first trial in a session, the remaining trials may be classified 
into four groups based on the current and previous trial as go-go with probability 0.75 2 = 
0.5625, stop-go with probability 0.25 x 0.75 = 0.1875, go-stop with probability 0.75 x 0.25 = 
0.1875, and stop-stop with probability 0.25 2 = 0.0625. If there is no interference, then the 
treatment at the current trial may have an effect, but not the treatment at the previous 
trial, so go-go should have the same effect as stop-go, and go-stop should have the same 
effect as stop-stop. Table 3 compares the two common cases, go-go to stop-go trials, 
ignoring other cases, using the same methods as in Table 2, reporting only results for 
filtered data. In Table 3, a difference indicates a very specific form of interference, namely 
a lingering effect from a previous stop trial on a current go trial. There is clearly evidence 
in Table 3 of a lingering effect of a previous stop trial, but the magnitudes of effect are much 
smaller than in Table 2 for the effect of the treatment in the current trial. Importantly, 
the inferences in Table 3 are appropriate for comparing go-go and stop-go trials even if 
other forms of interference are also present. 

Head movements during the experiment may distort fMRI readings. As discussed 
and illustrated in Rosenbaum (2007, §6), instead of studying the attributable effect for 
the responses themselves, the method in Rosenbaum (2002) may be used as the basis for 
randomization inference about the attributable effect on residuals from a robust covariance 
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Table 4: Comparison of STN activity with robust covariance adjustment for head move- 
ment. 





Test of No Effect H 


Fractional Increase Az /E ( Tz ) 




Deviate 


Point Estimate 


95% CI 




Filtered 


k=2 


10.974 


0.099 


0.084 


k=5 


12.528 


0.271 


0.235 


k=10 


11.103 


0.445 


0.379 



adjustment. Table 4 applies the method of Table 2 to residuals of STN levels after 
adjustment for six covariates describing translation and rotation of the head as estimated 
from three-dimensional images of each session, the residuals being obtaining using the 
default settings of the R function rim which implements Huber's m-estimation. Table 4 
is generally similar to Table 2, so covariance adjustment for head motion did not greatly 
alter the results. 

5 A Simulation of the Size and Power of Competing Tests in the Presence of 
Interference Between Units 

5.1 Description of the simulation 

Tables 5 and 6 report a simulation study of power with and without interference between 
units. Both tables refer to a completely randomized experiment; that is, there is a single 
block, B = 1. In Table 5, there are iV = 250 trials, whereas in Table 6 there are iV = 1000 
trials. Each trial is randomly assigned to be a treated trial or a control trial with probability 
i . As in the actual experiment in §1.3| only some treated trials elicit the intended cognitive 
activity and brain response. In Table 5, A = 50% of treated trials are successful, whereas 
in Table 6, A = 10% of treated trials are successful. A control trial yields a response 
drawn from a distribution F (•), and in Tables 5 and 6 this distribution F (•) is either the 
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standard Normal distribution or the ^-distribution on 2 degrees of freedom. In the absence 
of interference, a successful treated trial yields a response from F v (•) and an unsuccessful 
trial yields a response from F (•), so an unsuccessful treated trial looks like a control trial, 
but a successful treated trial looks like the maximum of v independent control trials; 
see Lehmann (1953), Salzburg (1986) and Conover and Salzburg (1988) for discussion and 
history of this mixture model. Formally, in the absence of interference, the Salzburg model 
yields control responses from F (•) and treated responses from (1 — A) F (■) + XF u (•), where 
Lehmann had considered v = 2, and Conover and Salzburg had determined the locally most 
powerful ranks as A — > 0, which are essentially Wilcoxon's ranks for v = 2. In this mixture 
model, successful treated trials are from F v (•) and unsuccessful treated trials are from 
F (■), but trials are not labeled as successful or unsuccessful. We introduce interference 
into this model by assuming that a successful trial samples from F v (•) rather than F (■) 
only if certain additional conditions hold defined in terms of treatments assigned to the 
previous few trials. 

In Table 5, v = 10, but in Table 6, v = 20; that is, a larger v in Table 6 offsets a smaller 
A so that the power remains in an interesting range. The maximum of v = 10 independent 
observations from a Normal distribution will often be smaller and more stable than the 
maximum of v = 10 observations from a t-distribution with 2 degrees of freedom, and this 
may affect different tests in different ways. 

Four types of interference were simulated. With interference of type A, a successful 
treated trial that immediately follows a control trial has a response drawn from F v (•), but 
all unsuccessful trials and all treated trials that immediately follow other treated trials have 
responses drawn from F (■). With interference of type B, a successful treated trial that 
immediately follows a treated trial has a response drawn from F v (•), but all unsuccessful 
trials and all treated trials that immediately follow control trials have responses drawn 
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Table 5: Simulated power with interference in a randomized experiment in a single block, 
B = 1, of size N = 250, when 50% of trials are successful, A = 0.5. The case v = 1 is 
the null hypothesis of no effect and hence no interference among effects, so the simulation 
is estimating the true size of a test with nominal level 0.05. The statistic k = 2 is the 
Mann- Whitney- Wilcoxon statistic. The highest power in a non-null row is in bold. 





A = 0.5, N = 250 




No Autoregressive Errors Added 










is 


Normal 










t-test 


k = 2 


k = 5 


/ 1 n 

k = 10 
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1, No effect 


0.0434 


0.0462 


0.0432 


0.0412 


V 




10, No interference 


0.9992 


0.9998 


1.0000 


0.9856 


V 




10, Interference A 


0.8028 


0.8014 


0.8928 


0.7328 


V 




10, Interference B 


0.8006 


0.7968 


0.8810 


0.7274 


V 




10, Interference C 


0.3056 


0.2830 


0.3704 


0.2806 


V 




10, Interference D 


0.1174 


0.1060 


0.1238 


0.0896 








F(-) 


is the t-distribution, 


2 df 








t-test 


k = 2 


k = 5 


k = 10 


V 




1, No effect 


0.0410 


0.0448 


0.0430 


0.0436 


V 




10, No interference 


0.9542 


1.0000 


1.0000 


0.9854 


V 




10, Interference A 


0.6610 


0.8130 


0.9004 


0.7392 


V 




10, Interference B 


0.6510 


0.7998 


0.8892 


0.7316 


V 




10, Interference C 


0.2464 


0.2838 


0.3652 


0.2704 


V 




10, Interference D 


0.0966 


0.1188 


0.1302 


0.0984 




Autoregressive Errors Added 
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k = 5 


k = 10 
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1, No effect 


0.0494 


0.0518 


0.0458 


0.0476 
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10, No interference 


0.9714 


0.9744 


0.9454 


0.7670 


V 




10, Interference A 


0.4868 


0.4824 


0.4528 
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10, Interference B 
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0.4772 


0.4572 


0.3002 


V 




10, Interference C 


0.1622 


0.1562 


0.1498 


0.0892 


V 




10, Interference D 


0.0786 


0.0746 


0.0726 


0.0524 
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is the t-distribution, 2 df 








t-test 
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k = 5 


k = 10 
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1, No effect 


0.0442 


0.0524 


0.0476 


0.0476 


V 




10, No interference 


0.9506 


0.9968 


0.9976 


0.9670 


V 




10, Interference A 


0.5826 


0.6602 


0.7494 


0.6184 


V 




10, Interference B 


0.5810 


0.6534 


0.7316 


0.5926 


V 




10, Interference C 


0.1996 


0.2218 


0.2502 


0.1826 


V 




10, Interference D 


0.0866 


0.0928 


0.0950 


0.0704 
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Table 6: Simulated power with interference in a randomized experiment in a single block, 
B = 1, of size N = 1000, when 10% of trials are successful, A = 0.1. The case v = 1 is 
the null hypothesis of no effect and hence no interference among effects, so the simulation 
is estimating the true size of a test with nominal level 0.05. The statistic k = 2 is the 
Mann- Whitney- Wilcoxon statistic. 





A = 0.1, N = 1000 
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0.1598 


V 


= 20, Interference D 
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is the t-distribution 
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from F(-). Interference C and D resemble interference A, except that in C a successful 
treated trial only has a response drawn from F v (•) if it follows 2 or more control trials, 
and in D if it follows 3 or more control trials. 

Interference between units creates one type of dependence over successive trials, but 
there can also be other types of dependence that are present in the absence of interference, 
indeed present in the absence of any treatment effect. The upper half of Tables 5 and 6 
is dependent over successive trials only due to interference. In the lower half of Tables 
5 and 6, the responses above are added to stationary autoregressive errors with standard 
Normal marginal distributions and autocorrelation 0.5. 

Each situation was simulated 5000 times, so the simulated power has a standard error 
of at most ^.25/5000 = 0.007. 

5.2 Results of the simulation 

Tables 5 and 6 contrast the size and power of four test statistics, namely the conventional 
pooled variance t-statistic and Tz for three values of k, k = 2, k = 5, and k = 10. Recall 
that k = 2 corresponds with the Mann- Whitney- Wilcoxon statistic, and k = 5 is similar 
to the suggestion of Salzburg (1986) and Conover and Salzburg (1988). 

The case of v = 1 in Tables 5 and 6 is the null hypothesis: it suggests that all four tests 
have size close to their nominal level of 0.05 in all sampling situations. This is expected 
for Tz because it is a randomization test applied under the null hypothesis of no effect 
in a randomized experiment. For related results about the randomization distribution 
of statistics such as the t-statistic, see Welch (1937). Notice that, because this is a 
randomization test in a randomized experiment, it has the correct level even in the case of 
autocorrelated errors. In brief, because all four tests appear to be valid, falsely rejecting 
true hypotheses at the nominal rate of 5%, it is reasonable to contrast the tests in terms 
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of power. 

In the non-null cases, v > 1, the test with the highest power is in bold. No one 
test is uniformly best in the situations considered in Tables 5 and 6, but the t-test and 
the Mann- Whitney- Wilcoxon test are never much better than k = 5 and are often much 
worse. The statistic with k = 10 performs well only in Table 6 where successful trials 
occur only 10% of the time. The permutational t-statistic performs well only when both 
F (•) and the autoregressive errors are Normal, and it performs poorly when the F (■) is 
the t-distribution with 2 df. When F (•) is Normal and there are no autoregressive errors, 
the permutational i-statistic typically had less power than k = 5. 

Tables 5 and 6 exhibit many patterns. It is not surprising that the addition of Gaussian 
autoregressive errors reduces power: the power in the top half of Tables 5 and 6 is typically 
quite a bit higher than the corresponding power in the lower half of the tables. Both types 
of interference, A and B, reduce power when compared with no interference, but in Tables 
5 and 6 interference A and B had similar effects on power. Interference patterns C and D 
reduce the number of responses that differ from control, so they reduce power relative to 
case A, but the suggestion of Conover and Salsburg, namely k = 5, exhibits decent relative 
performance in most if not all cases. 

5.3 Comparison with SPM 

A common approach to the analysis of fMRI data is the statistical parametric map (SPM) 
approach of Friston et al. (1995). Using responses convolved with a hemodynamic response 
function (HRF) as in Figure 2, the SPM approach entails testing a hypothesis of the equality 
of two regression coefficients in a generalized least squares analysis. We simulated this 
analysis with and without interference, with and without autoregressive errors, and with 
mixtures of successful and unsuccessful trials. The SPM approach uses a parametric 
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model and is not a randomization inference, so there is no reason to expect that it will 
have the correct level when there is no treatment effect but the parametric model is false. 
Indeed, in nominal 0.05 level tests in our simulation, true null hypotheses of no effect were 
rejected more often than 5% of the time, in some cases with probabilities as high as 30%. 
In light of this, a power comparison is not appropriate. It is not a fault of the SPM 
method that it does not control the type one error rate when the null hypothesis of no 
effect is true but the model itself is false; that type of error control is not expected from 
standard parametric inference. Presumably, a careful user of the SPM approach would 
check for model failures using residuals and diagnostics, and alter the parametric model in 
appropriate ways. Nonetheless, it is convenient that the randomization inferences in Sj3] 
do control the type one error rate at 5% in the presence of autocorrelation, interference 
between units, unsuccessful trials and error distributions (such as the i-distribution with 2 
degrees of freedom) that lack a finite variance. 

5.4 Alternative designs and power 

The simulation has compared the power of different statistics in given situations with 
interference. Another potential source of increased power entails alternative experimental 
designs which alter the degree of interference by altering the time interval between trials. 
In the absence of interference, we generally expect more power with more trials, so naively 
we might expect increased power from ever more trials ever more rapidly paced. However, 
in cross-over designs, it is also commonly said that interference should be reduced by 
allowing time for a wash-out period between trials. In particular, it is possible that fewer 
trials with more time between them would yield less interference and fMRI activity that is 
more sharply distinct following treatment or control. If one were using a statistic that is 
valid only in the absence of interference, then the power in these two situation could not be 
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compared, because a broader range of validity is being weighed against possibly reduced 
power. In contrast, using the randomization distribution of Tz to test Hq, the test is 
valid, with correct level, for both rapid-fire designs with many trials and widely-spaced 
designs with fewer trials, and a comparison of power is possible. For instance, a smaller 
number of trials with more successful trials and less interference (N = 250, A = .5, case 
A in Table 3) yields greater power than more trials with fewer successful trials and more 
interference (N = 1000, A = .1, case D in Table 4), so it is clear that increasing the number 
of trials must be weighed against potential harms from increasing the pace at which trials 
are conducted. 

6 Summary 

Randomized experiments in cognitive neuroscience of the type described in £11.3l have three 
attributes that were important in the current discussion. First, with about 100 random- 
ized stimuli for a single brain in a session of 600 seconds, interference is likely: the stimulus 
applied in one trial is likely to affect the response measured for other trials. Interference 
that is local in time is almost inevitable because the measurable response to one stimu- 
lus lasts for more than six seconds, but additionally as the trial progresses a subject is 
growing more familiar and experienced with the tasks and equipment, so interference may 
have a complex form that can extend across different sessions for the same subject. The 
use of the HRF function in passing from Figure 1 to Figure 2 is a standard attempt to 
pick out the response to a particular stimulus, and useful though this is, it is at best an 
approximation. Second, with rapid fire trials of this sort, not every trial will be successful 
in eliciting the intended cognitive activity. This is quite evident in the experiment in 
§1.31 because subjects responded inappropriately to some go or stop trials, but inattention, 
distraction or confusion can also occur without visible evidence. Some exposures to a 
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stimulus stimulate the intended thought process, some don't. Third, because this is a 
randomized experiment, randomization can form the basis for inference, thereby avoiding 
assumptions of independence and non-interference. Within this context, we have proposed 
and illustrated a straightforward, robust methodology that (i) yields a confidence interval 
for the magnitude of effect despite interference between units, and (ii) often has greater 
power than procedures based on Wilcoxon's statistic when only some treated trials are 
successful. 
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